Motivate the choice of learning/using R for scientific quantitative analysis, and lay out some fundamental concepts in biostatistics with concrete R coding examples.
Lecture 1: topics
Introduction to R and R-studio
Why R?
Principles of reproducible analysis with R + RStudio
R objects, functions, packages
Understanding different types of variables
Principles of “tidy data”
Descriptive statistics
Measures of central tendency, measures of variability (or spread), and frequency distribution
Visual data exploration
{ggplot2}
INTRO TO R AND RSTUDIO
R version
If you have previously installed R on your machine, you can check which version you are running by executing this command in R:
R is available for free for Windows , GNU/Linux , and macOS .
To install R, go to this link https://cloud.r-project.org/. The latest available release is R 4.3.3 “Angel Food Cake” released on 2024-02/29, but any (fairly recent) version will do.
Install RStudio IDE
RStudio Desktop is an Integrated Development Editor (IDE), basically a graphical interface wrapping and interfacing R (which needs to be installed first).
Besides RStudio, R (which is a command line driven program) can be executed:
via its native interface (R GUI)
from many other code editors, like VS Code, Sublime Text, Jupyter Notebook
An R Project will keep all the files associated with a project (including invisible ones!) organized together – input data, R scripts, analytical results, figures. Besides being common practice, this has the advantage of implicitly setting the “working directory”, which is incredibly important when you need to load or output files, specifying their file path.
In Figure 1 you can see how easy it is just following RStudio prompts:
Create a new directory for each project
Select parent folder
Creating an R Project [in Rstudio] (cont.)
Figure 1: Creating an R project
Install R packages from CRAN (stable version)
An R package* is a shareable bundle of functions. Besides the basic built-in functions already contained in the program (i.e. the base package), many useful R functions come in free libraries of code (or packages) written by R’s users. You can find them in different repositories:
To install a package use utils function install.packages("package_name)
# Installing (ONLY the 1st time)utils::install.packages('here')# OR (same)install.packages('here')
Install R packages RStudio pane
In alternative, you can install/update packages using the Packages tab on the lower right pane of RStudio.
Screenshot Install/Update pckgs from RStudio
Install R packages from GitHub (testing version)
Use install_github from the package devtools. EXAMPLE: let’s install a little package paint (which colors the structure of dataset when printing).
Code
# Installing devtools (ONLY the 1st time)utils::install.packages('devtools')# Installing paint from GitHub library(devtools)devtools::install_github("MilesMcBain/paint")# test paint outlibrary(paint)
Output {paint} function
# Structure of a data.frame paint::paint(mtcars)
Use R Packages
We will be using {base} & {utils} (pre-installed and pre-loaded)
We will also use the packages below (specifying package::function for clarity).
# Load them for this R sessionlibrary(here) # tools find your project's files, based on working directorylibrary(janitor) # tools for examining and cleaning datalibrary(skimr) # tools for summary statistics library(dplyr) # {tidyverse} tools for manipulating and summarising tidy data library(ggplot2) # {tidyverse} tools for plottinglibrary(forcats) # {tidyverse} tool for handling factorslibrary(ggridges) # alternative to plot density functions library(fs) # file/directory interactions
Help on R package/function
To inquire about a package and/or its functions, you can again write in your console ?package_name or ??package_name and RStudio will open up the Help tab in the lower right pane.
# Opening Help page on package/function?here??here
File paths logistics
It is never good practice to “hard code” the file’s absolute path: most likely it will break your code as soon as you (or someone else) need to run it on a different computer, let alone within a different OS.
Let’s look at this example code using function readr::read_csv() (which reads a *.csv data file into the R workspace)
# [NOT REPRODUCIBLE] hard coding your file path -----------------------# File path on Mac:dataset <- readr::read_csv("/Users/testuser/R4biostats/input_data/dataset.csv")# Same file path on Windows:dataset <- readr::read_csv("C:\Users\testuser\R4biostats\input_data\dataset.csv")
🙄 …it won’t work on any other computer since it won’t have that same file structure!
(Reproducible) file paths with here (in Rstudio)
The here package lets you reference file paths in a reproducible manner (anchored on the R Project’s folder as the root).
Where is my Working Directory?
here::here()
You should get: “/Users/YourName/RProj_Dir”
Now, you can embed here(dir,subdir) specifications in other functions.
For example, create sub-directories (for saving input data and output data) with the fs package
## --- [check the function documentation]?fs::dir_create# with `here` I simply add subfolder names relative to my wd fs::dir_create(here("practice", "data","data_input"))# ...and a subfolder to put output files at the endfs::dir_create(here("practice", "data","data_output"))## --- [if I need to remove it (I have them already)]fs::dir_delete(here("practice", "data"))
R OBJECTS, FUNCTIONS, PACKAGES
Importing data into R workspace
We are using real data provided by Thabtah,Fadi. (2017). Autism Screening Adult. UCI Machine Learning Repository. https://doi.org/10.24432/C5F019
autism_data_url <-read.csv(file ="https://raw.githubusercontent.com/Sydney-Informatics-Hub/lessonbmc/gh-pages/_episodes_rmd/data/autism_data.csv", header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?") # specific values R should interpret as NA)
Option 2: Importing from my folder (if you previously downloaded the file)
here lets me specify the complete path of the destination folder
Tip
Make sure to match your own folder structure the file path here(...)!
# Check my working directory location# here::here()# Use `here` in specifying all the subfolders AFTER the working directory autism_data_file <-read.csv(file =here("practice", "data_input", "01_datasets", "autism_data.csv"), header =TRUE, # 1st line is the name of the variablessep =",", # which is the field separator character.na.strings =c("?"),# specific values R should interpret as NArow.names =NULL)
DATA OBSERVATION & MANIPULATION
Viewing the dataset and variables
View(autism_data_file)
Or click on the object in Environment tab (upper right pane of RStudio)
# What data type is this data?class(autism_data_file)
[1] "data.frame"
# What variables are included in this dataset?base::colnames(autism_data_file)
[1] "White-European" "Latino" "Latino" "White-European"
[5] NA "Others"
Option 2b Extract rows with [#row,]
# Indexing to pick `[#row, ]` head(autism_pids[1 , ] ) # empty cols means all
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
1 1 1 1 1 1 0 0 1 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
1 0 0 26 f White-European no no United States
used_app_before result age_desc relation Class_ASD pids
1 no 6 18 and more Self NO PatientID_1
head(autism_pids[50,])
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
50 50 1 1 0 0 0 1 1 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
50 0 1 30 f Asian no no Bangladesh
used_app_before result age_desc relation Class_ASD pids
50 no 6 18 and more Self NO PatientID_50
head(autism_pids[25:26 ,])
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
25 25 1 1 1 1 0 0 0 1
26 26 0 1 1 0 0 0 0 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
25 0 0 43 m <NA> no no Lebanon
26 0 0 24 f <NA> yes no Afghanistan
used_app_before result age_desc relation Class_ASD pids
25 no 5 18 and more <NA> NO PatientID_25
26 no 3 18 and more <NA> NO PatientID_26
Option 3 Extract rows & cols with [#row,#col]
# Indexing to pick `[#row, #col]` autism_pids[1:3,1]
[1] 1 2 3
autism_pids[1:3,23]
[1] "PatientID_1" "PatientID_2" "PatientID_3"
autism_pids[1:3,2]
[1] 1 1 1
autism_pids[1:3,14]
[1] "White-European" "Latino" "Latino"
What are the data types of the variables?
Option 1 using base functions
on the whole dataset
# What are the data types of the variables? ---------------------------------str(autism_pids) # integer and character
#### char 2 factor -------------------------------------------------------------# Say I want to treat some variables as factorsautism_pids$gender <-as.factor(autism_pids$gender)autism_pids$ethnicity <-as.factor(autism_pids$ethnicity)autism_pids$contry_of_res <-as.factor(autism_pids$contry_of_res)autism_pids$relation <-as.factor(autism_pids$relation)# check class(autism_pids$gender)
[1] "factor"
class(autism_pids$ethnicity)
[1] "factor"
class(autism_pids$contry_of_res)
[1] "factor"
class(autism_pids$relation)
[1] "factor"
From character to factor using base R (n cols)
autism_pids_temp <- autism_pids # copy df for test to_factor <-c("gender", "ethnicity", "contry_of_res", "relation") # vector of col names autism_pids_temp[ ,to_factor] <-lapply(X = autism_pids[ ,to_factor], FUN = as.factor)# check class(autism_pids_temp$gender)
I may prefer to code a variable as logical. For example, age_desc may be more explicit if coded as logical.
I create a new column age_desc_log
# observe a subset of some columns autism_subset <- autism_pids [1:5, c("gender","jaundice", "autism","age_desc","Class_ASD","pids")]# View(autism_subset)# recode "age_desc" as LOGICAL new var "age_desc_log"autism_pids$age_desc_log <-ifelse(autism_pids$age_desc =="18 and more", TRUE, FALSE )class(autism_pids$age_desc)
[1] "character"
class(autism_pids$age_desc_log)
[1] "logical"
From character to dummy [0,1]
I also may need binary variables expressed as [0,1] (e.g. to incorporate nominal variables into regression analysis). Let’s recode autism.
head or tail return the first or last parts of an object
head(autism_pids) #return fist 6 obstail(autism_pids) #return last 6 obs
using head or tail from utils (cont.)
head(autism_pids, n =2) #return fist 2 obs
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
1 1 1 1 1 1 0 0 1 1
2 2 1 1 0 1 0 0 0 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
1 0 0 26 f White-European no no United States
2 0 1 24 m Latino no yes Brazil
used_app_before result age_desc relation Class_ASD pids
1 no 6 18 and more Self NO PatientID_1
2 no 5 18 and more Self NO PatientID_2
age_desc_log autism_dummy
1 TRUE 0
2 TRUE 1
tail(autism_pids, n =2) #return last 2 obs
id A1_Score A2_Score A3_Score A4_Score A5_Score A6_Score A7_Score A8_Score
703 703 1 0 0 1 1 0 1 0
704 704 1 0 1 1 1 0 1 1
A9_Score A10_Score age gender ethnicity jaundice autism contry_of_res
703 1 1 35 m South Asian no no Pakistan
704 1 1 26 f White-European no no Cyprus
used_app_before result age_desc relation Class_ASD pids
703 no 6 18 and more Self NO PatientID_703
704 no 8 18 and more Self YES PatientID_704
age_desc_log autism_dummy
703 TRUE 0
704 TRUE 0
Investigating a subset of observations
E.g. I learned that some patients have missing age… how many are they?
# run...sum(is.na(autism_pids$age))
[1] 2
# or skimr::n_missing(autism_pids$age)
[1] 2
So, next, I want to ID those patients with missing age.
New df (patients missing age) as SUBSET of the given df
I want to extract only the obs (rows) of interest with a few useful vars (cols)
pids age autism_dummy
63 PatientID_63 NA 0
92 PatientID_92 NA 0
New df (patients missing age) as SUBSET of the given df
I want to extract only the obs (rows) of interest with a few useful vars (cols)
Option 3) using subset from base
# arguments allow me to specify rows and cols missing_age_subset3 <-subset(x = autism_pids, subset =is.na(autism_pids$age), # 1 logical conditionselect =c("pids", "age", "autism_dummy") # which cols ) missing_age_subset3
pids age autism_dummy
63 PatientID_63 NA 0
92 PatientID_92 NA 0
New df (filtering on 2 conditions) as SUBSET of the given df
Option 1) using base::subset
# Creates a SUBSET based on MORE conditions (`age` and `ethnicity`)twocond_base_subset <-subset(x = autism_pids, # 2 logical conditions subset = age <25& contry_of_res =="Brazil", # pick a few cols select =c("pids", "age", "contry_of_res","autism_dummy")) twocond_base_subset
pids age contry_of_res autism_dummy
2 PatientID_2 24 Brazil 1
54 PatientID_54 21 Brazil 1
94 PatientID_94 19 Brazil 1
429 PatientID_429 20 Brazil 0
587 PatientID_587 21 Brazil 0
588 PatientID_588 21 Brazil 0
New df (filtering on 2 conditions) as SUBSET of the given df
Option 2) using dplyr (filter + select)
Switching to the package dplyr and embracing the “pipe” (%>%) operator logic, in which the filtering (rows) and selecting (columns) is done in sequence
## here the filtering (rows) and selecting (columns) is done in sequencetwocond_dplyr_subset <- autism_pids %>% dplyr::filter(age <25& contry_of_res =="Brazil") %>%# which rows dplyr::select (pids, age, contry_of_res, autism_dummy) # which colstwocond_dplyr_subset
pids age contry_of_res autism_dummy
1 PatientID_2 24 Brazil 1
2 PatientID_54 21 Brazil 1
3 PatientID_94 19 Brazil 1
4 PatientID_429 20 Brazil 0
5 PatientID_587 21 Brazil 0
6 PatientID_588 21 Brazil 0
Dealing with missing data
Input values where missing
⚠️ WARNING︎: This is a very delicate & risky step ⚠️
any modified/imputed data (beyond the original collection) can affect subsequent analysis and statistical modeling
it will be necessary to document and justify whichever approach is used to deal with missing data.
Let’s assume we can get the missing data by cross-checking related clinical information
# 1/2 create a new variable autism_pids$age_inputed <- autism_pids$age# 2/2 replace value (presumably taken from other source) of `aged_inputed` # CONDITIONAL on `pids`autism_pids$age_inputed[autism_pids$pids =="PatientID_63"] <-65autism_pids$age_inputed[autism_pids$pids =="PatientID_92"] <-45# checkskimr::n_missing(autism_pids$age)
[1] 2
skimr::n_missing(autism_pids$age_inputed)
[1] 0
DESCRIPTIVE STATISTICS
Summarizing all variables
Try these 2 options:
base::c
summary(autism_pids)
skimr::skim
skimr::skim(autism_pids)
Notice summary different behavior according to the variable’s type
The function’s results depend on the class of the object
look at the output in case of integer (e.g. A1_Score)
summary(autism_pids$A1_Score) # min, max quartiles, mean, median
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0000 0.0000 1.0000 0.7216 1.0000 1.0000
look at the output in case of factor (e.g. ethnicity)
summary(autism_pids$ethnicity) # counts of levels' frequency (included NA!)
Asian Black Hispanic Latino Middle Eastern
123 43 13 20 92
others Others Pasifika South Asian Turkish
1 30 12 36 6
White-European NA's
233 95
Notice summary different behavior according to the variable’s type (cont.)
look at the output in case of logical (e.g. age_desc_log)
summary(autism_pids$age_desc_log) # counts of TRUE
Mode TRUE
logical 704
Frequency distributions with table
Frequency distributions can be used for nominal, ordinal, or interval/ration variables
0 1
Asian 118 5
Black 38 5
Hispanic 12 1
Latino 12 8
Middle Eastern 83 9
others 1 0
Others 28 2
Pasifika 10 2
South Asian 34 2
Turkish 5 1
White-European 183 50
Grouping and summarizing with base R
E.g. I want to know the average age of men and women sub-groups.
Option 1) using by
# by(data$column, data$grouping_column, mean)by(data = autism_pids$age_inputed, INDICES = autism_pids$gender, FUN = mean)
autism_pids$gender: f
[1] 29.60237
------------------------------------------------------------
autism_pids$gender: m
[1] 28.98365
Grouping and summarizing with base R
Option 2) using tapply
# i.e. apply a function to subsets of a vector or array, split by one or more factors.tapply(X = autism_pids$age_inputed, INDEX = autism_pids$gender, FUN = mean)
# A tibble: 2 × 4
gender mean_age N_obs N_with_autism
<fct> <dbl> <int> <int>
1 f 29.6 337 54
2 m 29.0 367 37
Measures of central tendency
Mean and median
Recall that:
Population MEAN\(\mu=\frac{\sum_{i=1}^n x_{i}}n\) Sample MEAN\(\bar{x}=\frac{\sum_{i=1}^n x_{i}}n\)
Sample MEDIAN
For uneven \(n\): \(Mdn = \frac{x_{(n+1)}}2\)
For even \(n\): \(Mdn = \frac{x_{(n/2)} + x_{(n/2+1)}}2\)
Mean/Median using base R
Important to specify the argument na.rm = TRUE or the functions won’t work
## Let's use `age` and `age_inputed` to see what inputed missing values did mean(autism_pids$age)
[1] NA
median(autism_pids$age)
[1] NA
mean(autism_pids$age, na.rm =TRUE)
[1] 29.20655
median(autism_pids$age, na.rm =TRUE)
[1] 27
mean(autism_pids$age_inputed)
[1] 29.27983
median(autism_pids$age_inputed)
[1] 27
Create custom function to calculate statistical mode 1/2
R doesn’t have a built-in function for the statistical mode, so we can create a custom one: f_calc_mode
Define the custom function
f_calc_mode <-function(x) { # `unique` returns a vector of unique values uni_x <-unique(x) # `match` returns the index positions of 1st vector against 2nd vector match_x <-match(x, uni_x)# `tabulate` count the occurrences of integer values in a vector. tab_x <-tabulate(match_x) # returns element of uni_x that corresponds to max occurrences uni_x[tab_x ==max(tab_x)]}
Create custom function to calculate statistical mode 2/2
Call the custom function
f_calc_mode(autism_pids$age)
[1] 21
f_calc_mode(autism_pids$age_inputed)
[1] 21
f_calc_mode(autism_pids$ethnicity)
[1] White-European
11 Levels: Asian Black Hispanic Latino Middle Eastern others ... White-European
Measures of variability (or spread)
Variance and Standard deviation
Recall that: Population Variance\(\sigma^2 = \frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}\)
Population Standard deviation\(\sigma = \sqrt\frac{\displaystyle\sum_{i=1}^{n}(x_i - \mu)^2} {n}\)
Sample Standard deviation\(s = \sqrt\frac{\sum{(x_i-\bar{x})^2}}{n-1}\)
Variance and Standard deviation using base R
Important to specify the argument na.rm = TRUE or the functions won’t work (or use the age_inputed variable)
var(autism_pids$age, na.rm =TRUE)
[1] 94.28966
var(autism_pids$age_inputed)
[1] 96.19328
sd(autism_pids$age, na.rm =TRUE)
[1] 9.710286
sd(autism_pids$age_inputed)
[1] 9.807817
VISUAL DATA EXPLORATION
Plotting with ggplot2
ggplot2 provides a set of tools to map data to visual elements on a plot, to specify the kind of plot you want, and then subsequently to control the fine details of how it will be displayed. It basically allows to build a plot layer by layer (Figure 2).
data -> specify what the dataset is
aesthetic mappings (or just aesthetics) -> specify which dataset’s variables will turn into the plot elements (e.g. \(x\) and \(y\) values, or categorical variable into colors, points, and shapes).
geom -> the overall type of plot, e.g. geom_point() makes scatterplots, geom_bar() makes barplots, geom_boxplot() makes boxplots.
Histograms split the data into ranges (bins) and show the number of observations in each. Hence, it’s important to pick widths that represents the data well.
The default value is 30
We can change it using the argument bins = #
autism_pids %>%ggplot(aes(x = age_inputed )) +# specify to avoid warning if we fail to specify the number of bins geom_histogram(bins=40) +theme_bw()
… define bin width
… add mean and std dev vertical lines
using geom_vline() to add a vertical line for the mean, and the range between -1 and +1 sd from the mean.
using annotate() for adding small annotations (such as text labels)
autism_pids %>%ggplot(aes(x = age_inputed )) +geom_histogram(bins=40) +# add mean vertical linegeom_vline(xintercept =mean(autism_pids$age_inputed),na.rm =FALSE,lwd=1,color="#9b2339") +# add annotations with the mean valueannotate("text", x =mean(autism_pids$age_inputed) *1.2, # coordinates for positioningy =mean(autism_pids$age_inputed) *2.5,label =paste("Mean =", round(mean(autism_pids$age_inputed), digits =2)),col ="#9b2339",size =4)+# add also sd +1 and -1 geom_vline(aes(xintercept =mean(autism_pids$age_inputed) +sd(autism_pids$age_inputed)), color ="#000000", size =1, linetype ="dashed") +geom_vline(aes(xintercept =mean(autism_pids$age_inputed) -sd(autism_pids$age_inputed)), color ="#000000", size =1, linetype ="dashed") +theme_bw()
using the specification position = 'dodge' inside geom_histogram()
# trying to improve readability autism_pids %>%ggplot(mapping =aes(x = age_inputed, fill = gender )) +# bars next to each other with `position = 'dodge'`geom_histogram(bins=40, position ='dodge') +scale_fill_manual(values =c("#e07689","#57b7ff")) +scale_color_manual(values =c("#9b2339","#005ca1")) +theme_bw()
… shifting bars by group
…facet by gender
That’s still not very easy to digest. Instead of only filling, you can separate the data into multiple plots to improve readability
adding facet_wrap() with the the specification of ~categ_var
also ncol = 1 requires the subplot to be in 1 column
autism_pids %>%ggplot(aes(x = age_inputed, fill = gender )) +geom_histogram(color="#e9ecef", alpha=0.8, position ='dodge') +theme_bw() +# splitting the gender groups, specifying `ncol` to see one above the otherfacet_wrap(~gender, ncol =1) +scale_fill_cyclical(values =c("#9b2339","#005ca1"))
…facet by gender
… adding 2 mean/median vert lines (by gender)
I want to see the mean vertical line for each of the subgroups, but in this case, I need to create a small dataframe of summary statistics (group_stats).
I do so by using dplyr add a column mean_age with the group mean
# A tibble: 2 × 3
gender mean_age median_age
<fct> <dbl> <dbl>
1 f 29.6 28
2 m 29.0 26
(Small digression on tidyr::pivot_longer)
The new small dataframe group_stats offers an example of reshaping, i.e. turning a table from a “wide” form (with each variable in its own column) to a “long” form (one column for both the measures names and another for both the measures values).
This can be done using tidyr::pivot_longer function, where these arguments must be specified:
# A tibble: 4 × 4
gender Stat Value label
<fct> <chr> <dbl> <chr>
1 f mean_age 29.6 f_mean_age
2 f median_age 28 f_median_age
3 m mean_age 29.0 m_mean_age
4 m median_age 26 m_median_age
…facet by gender + vert lines by group
Notice that now the plot will have 2 data sources:
autism_pids
group_stats_long
autism_pids %>%ggplot(aes(x = age_inputed, fill = gender)) +# geom_histogram from dataframe 1geom_histogram(bins=30,color="#e9ecef", alpha=0.8, position ='dodge') +facet_wrap(~gender, ncol =1) +scale_fill_manual(values =c("#9b2339","#005ca1")) +# geom_vline from dataframe 2geom_vline(data = group_stats_long, mapping =aes(xintercept = Value, color = Stat),lwd=1,linetype=1) +scale_color_manual(values =c( "#f0a441" , "#d8cf71")) +theme_bw()
…facet by gender + vert lines by group
… finishing touches
using labs() and theme() layers
hist_plot <- autism_pids %>%ggplot(aes(x = age_inputed, fill = gender)) +# geom_histogram from dataframe 1geom_histogram(bins=30,color="#e9ecef", alpha=0.8, position ='dodge') +facet_wrap(~gender, ncol =1) +scale_fill_manual(values =c("#9b2339","#005ca1")) +# geom_vline from dataframe 2geom_vline(data = group_stats_long, mapping =aes(xintercept = Value, color = Stat),lwd=1.5,linetype=6) +scale_color_manual(values =c( "#e68000", "#d8cf71")) +# increase number of x axis ticks scale_x_continuous(breaks =seq(10, 100,10 ), limits =c(10,70)) +# Additional theme details labs(x ="age brackets", y ="n of individuals",color ="Stats",title ="Distribution of observations by gender",subtitle ="",caption ="Source: Thabtah,Fadi (2017) https://doi.org/10.24432/C5F019.") +theme(legend.position ="right",plot.title =element_text(face ="bold")) hist_plot
… finishing touches
Density ggridges package
As an alternative, you can use the ggridges package to make ridge plots. The geom geom_density_ridges calculates density estimates from the provided data and then plots those, using the ridgeline visualization. In this case plots include a vertical median line.
autism_pids %>%# this takes also `y` = groupggplot(aes(x=age_inputed, y = gender, fill = gender)) + ggridges::geom_density_ridges() +# I can add quantile lines (2 is the median)stat_density_ridges(quantile_lines =TRUE, quantiles =c(0.5), alpha =0.75)+# increase number of x axis ticks scale_x_continuous(breaks =seq(10, 100,10 ), limits =c(16, 86)) +scale_fill_cyclical(values =c("#9b2339","#005ca1")) +theme_bw()
Density ggridges package
Barchart
Bar charts provide a visual presentation of categorical data, with geom_bar() (height of the bar proportional to the number of cases in each group)
autism_pids %>%ggplot(aes(x = ethnicity )) +geom_bar(fill ="steelblue") +# reference line geom_hline(yintercept=100, color ="#9b2339", size=0.5, ) +# labels, title, etc labs(x ="ethnicity", y ="n of individuals",color ="Stats",title ="Distribution of observations by ethnicity",subtitle ="",caption ="Autism study") +theme_bw() +# specification son axis labelstheme(axis.text.x =element_text(angle=50, vjust=0.75), axis.text.y =element_text(size=10,face="bold"))
…improve theme
…improve readability (reorder bars)
Reordering the bars by count using the package forcats and its function fct_infreq
(which we can do because ethnicity was recoded as factor)
autism_pids %>%# we modify our x like so ggplot(aes(x = forcats::fct_infreq(ethnicity ))) +geom_bar(fill ="steelblue") +geom_hline(yintercept=100, color ="#9b2339", size=0.5, ) +labs(x ="ethnicity", y ="n of individuals",color ="Stats",title ="Distribution of observations by ethnicity",subtitle ="",caption ="Autism study") +# --- wrap long x labels (flipped ) !!!# scale_x_discrete(labels = function(x) stringr::str_wrap(x, width = 10)) +theme_bw() +theme(axis.text.x =element_text(angle=50, vjust=0.75), axis.text.y =element_text(size=10,face="bold"))
…improve readability (reorder bars)
…improve readability (highlight NA)
Let’s highlight the fact that the last column (NA) represents missing values.
Create the highlight variable
Map color to a variable (fill = highlight)
…improve readability (highlight NA) code
autism_pids %>%## --- prep the dataframe dplyr::mutate(# Add a factor variable with two levelshighlight = forcats::fct_other(ethnicity, keep ="NA", other_level ="All Groups")) %>%## --- now plot # In `aes mapping` we map color to a variable (`fill = highlight`)ggplot(aes(x = forcats::fct_infreq(ethnicity), fill = highlight)) +geom_bar()+# Use custom color palettesscale_fill_manual(values=c("#0084e6")) +# Add a line at a significant level geom_hline(yintercept=100, color ="#9b2339", size=0.5, ) +theme_bw() +# make some more theme specifications labs(x ="ethnicity", y ="n of individuals",color ="Stats",title ="Distribution of observations by ethnicity",subtitle ="",caption ="Autism study") +theme(axis.text.x =element_text(angle=50, vjust=0.75), axis.text.y =element_text(size=10,face="bold")) +theme(legend.position ="none")
…improve readability (highlight NA) code
Boxplot
The boxplot is one of the simplest ways of representing a distribution of a continuous variable and it is packed with information. It consists of two parts:
Box — Extending from the 1st to the 3rd quartile (Q1 to Q3) with a line in the middle that represents the median.
Whiskers — Lines extending from both ends of the box (minimum/maximum whisker values are calculated as Q1/Q3 -/+ 1.5 * IQR)
Let’s use a boxplot to explore how the continuous variable result is distributed in the autism dataset.
in the aesthetic mapping we specify only x (continuous variable)
switch to vertical orientation with coord_flip()
autism_pids %>%ggplot(aes(x = result )) +geom_boxplot(alpha=0.5)+# switch to vertical orientationcoord_flip() +theme_bw()
Boxplot example 1
Boxplot example 2
Let’s also explore how the continuous variable result is distributed by the categorical variable (factor) ethnicity.
in the aesthetic mapping we specify y (continuous variable), plus x and fill (categorical variable)
make x axis labels readable with theme(axis.text.x (...)) layer
I specify colors that I had previously saved in a color palette contrast_cols_palette
autism_pids %>%ggplot(aes(x = ethnicity, y= result, fill = ethnicity)) +geom_boxplot(alpha=0.5)+scale_fill_manual(values = contrast_cols_palette) +theme_bw()+# make x axis labes readabletheme(axis.text.x =element_text(angle=50, vjust=0.75)) +# drop legend and Y-axis titletheme(legend.position ="none")
Boxplot example 2
Violin plot
Similarly, the violin plot is an interesting alternative to show the distribution of a continuous variable along one or more categorical variables. Here, the kernel density plot shows the smoothed curve of the probability density function (PDF) of the data.
Compared to the box plot, a violin plot provides more information, as it shows not only the summary statistics but also the shape and variability of the data (i.e. helping to identify any skewness or multimodality in the data).
Violin plot example
it requires the geom_violin function
it can be enriched by adding with other geoms, such as points, lines, or box plots, to create more complex and informative plots
let’s add points with the geom_point layer
autism_pids %>%ggplot(mapping =aes(y = age_inputed, x = ethnicity, fill = ethnicity)) +geom_violin(alpha=0.5) +geom_point(position =position_jitter(width =0.1), size =0.5)+scale_fill_manual(values = contrast_cols_palette) +# make x axis labes readabletheme(axis.text.x =element_text(angle=50, vjust=0.75)) +# drop legend and Y-axis titletheme(legend.position ="none")
Violin plot example
SAVING & EXPORTING OUTPUT ARTIFACTS
Saving one plot
If I want to use these output files later, I can easily save in the output folder created at the beginning.